Study of Chinese Text Similarity Based on Difference Factor in Word-Number
نویسندگان
چکیده
Text similarity calculation is the basic work in the application of Chinese information processing. A highquality text similarity calculation method must be accurate and efficient, that is, it can be able to compare texts from the level of text natural language meaning, and arrive at the similarity distinction similar to artificial reading based on a full understanding of the author or text source semantic. At the same time, it should also be an efficient algorithm to save the processing time in facing large amount of text information to be processed. Through the research of many domestic and foreign literature, analysis and further research on current situation of similarity calculation, this paper intended to present a new method to improve the performance of similarity calculation, namely a Chinese text similarity algorithm based on word-number difference, which combined the traditional based on statistics and the narrow semantic method that meant the combination of the statistical efficiency and semantic accuracy. Combining the advantages of statistics and semantic category also means the necessity to face and overcome disadvantages of the two kinds of methods. This paper attempted to take the difference in word-number as the breakthrough point, took advantage of the diversity of Chinese word-number, combining with the word frequency, number and meaning, in order to successfully extend the word similarity calculation to the text similarity calculation. Finally, introduced the self built small text set as test object, compared similarity calculation of different methods in the laboratory environment. It shows that the similarity calculation method based on difference in word-number performances better than the traditional methods based on statistical and semantic. Through artificial comparison of the test results of research on this topic in accuracy and speed of segmentation, provide a new approach for Chinese text similarity calculation
منابع مشابه
Keyword Extraction From Chinese Text Based On Multidimensional Weighted Features
This paper proposed to solve the problems of incomplete coverage and low accuracy in keyword extraction of Chinese text based on intrinsic feature of the Chinese language and an extraction method of multidimensional information weighted eigenvalues. This method combined theoretical analysis and experimental calculation to study the parts of speech, word position, word length, semantic similarit...
متن کاملThe Research of Chinese Words Semantic Similarity Calculation with Multi-Information
Text similarity has a relatively wide range of applications in many fields, such as intelligent information retrieval, question answering system, text rechecking, machine translation, and so on. The text similarity computing based on the meaning has been used more widely in the similarity computing of the words and phrase. Using the knowledge structure of the and its method of knowledg...
متن کاملThe Effect of Pictorial Flashcards on the Sight Word Recognition in Kindergartens
It was a quasi-experimental study because the study involved in training participants in twoclasses each containing about 5 to 6 years old pre-primary students. To this end, fifty studentsparticipated in the study who were studying at Misagh School in Tabriz. In order to makesure of their homogeneity, the researcher administered a pre-test. Based on the results, 40students were selected as the ...
متن کاملA Component Histogram Map Based Text Similarity Detection Algorithm
The conventional text similarity detection usually use word frequency vectors to represent texts. But it is high-dimensional and sparse. So in this research, a new text similarity detection algorithm using component histogram map (CHM-TSD) is proposed.This method is based on the mathematical expression of Chinese characters, with which Chinese characters can be split into components. Then each ...
متن کاملThe Dependence of Frequency Distributions on Multiple Meanings of Words, Codes and Signs
The dependence of the frequency distributions due to multiple meanings of words in a text is investigated by deleting letters. By coding the words with fewer letters the number of meanings per coded word increases. This increase is measured and used as an input in a predictive theory. For a text written in English, the word-frequency distribution is broad and fat-tailed, whereas if the words ar...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of Multimedia
دوره 9 شماره
صفحات -
تاریخ انتشار 2014